Skip to content

Restores change to the Gemini system prompt#8

Merged
andrew-goldstein merged 1 commit into
fix/langgraph-esqlfrom
fix/langgraph-esql-restore-gemini-prompt-update
Aug 2, 2024
Merged

Restores change to the Gemini system prompt#8
andrew-goldstein merged 1 commit into
fix/langgraph-esqlfrom
fix/langgraph-esql-restore-gemini-prompt-update

Conversation

@andrew-goldstein
Copy link
Copy Markdown
Collaborator

Restores a change to the Gemini system prompt, to enable tool calling when the user clears the default system prompt.

@andrew-goldstein andrew-goldstein merged commit 30264c9 into fix/langgraph-esql Aug 2, 2024
@andrew-goldstein andrew-goldstein deleted the fix/langgraph-esql-restore-gemini-prompt-update branch August 2, 2024 22:20
patrykkopycinski pushed a commit that referenced this pull request Sep 23, 2024
fixes
[#8](elastic/observability-accessibility#8)
fixes
[#7](elastic/observability-accessibility#7)
 
## Summary

Fixes APM breadcrumbs on serverless

| Serverless  |  Stateful  |
|---|---|
| <img width="700px" alt="image"
src="https://github.com/user-attachments/assets/944a7d58-7de3-4a7f-be02-3c8c1110a0e2">
|<img width="800px" alt="image"
src="https://github.com/user-attachments/assets/450664b1-ddfc-4395-9fa3-a7b941affb3b">|
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/944a7d58-7de3-4a7f-be02-3c8c1110a0e2">
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/450664b1-ddfc-4395-9fa3-a7b941affb3b">|
| <img width="500px" alt="image"
src="https://github.com/user-attachments/assets/944a7d58-7de3-4a7f-be02-3c8c1110a0e2">
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/cb8a39e2-ca33-4cf9-a8ac-4c84566d092d">|
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/151a3a9c-c81e-4558-9d00-e695e3d1d79c">|<img
width="500px" alt="image"
src="https://github.com/user-attachments/assets/2562e96f-d5e4-4aa4-a221-6721f8995883">|
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/8d877d11-8c3f-4ac5-8146-6a11125eae7c">|<img
width="500px" alt="image"
src="https://github.com/user-attachments/assets/36e588cb-4c18-4d66-a2c6-f0e66392f708">|
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/14253196-06de-4343-811f-61aa31ea0d1e">|<img
width="500px" alt="image"
src="https://github.com/user-attachments/assets/0cdfc83f-6545-433f-8c14-5bbf2a581175">|
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/89a58e2b-2cef-4188-b2be-f359ba6890db">|<img
width="500px" alt="image"
src="https://github.com/user-attachments/assets/f15e767f-5b60-4485-ac71-7b6fd850ec50">|
|<img width="500px" alt="image"
src="https://github.com/user-attachments/assets/a0f7bfae-bfda-4f49-b92a-e736d80fea4c">|<img
width="500px" alt="image"
src="https://github.com/user-attachments/assets/680db8ab-58b8-454b-a0d7-6e1681dbe616">|


### How to test
#### Serverless
- Start a local ES serverless instance: `yarn es serverless
--projectType=oblt --ssl -k/--insecure`
- Start a local Kibana serverless instance: ` yarn start
--serverless=oblt --no-ssl`
- Run some synthtrace scenarios
- `NODE_TLS_REJECT_UNAUTHORIZED=0 node scripts/synthtrace mobile.ts
--live --target=https://elastic_serverless:changeme@127.0.0.1:9200
--kibana=http://elastic_serverless:changeme@0.0.0.0:5601`
- `NODE_TLS_REJECT_UNAUTHORIZED=0 node scripts/synthtrace service_map.ts
--live --target=https://elastic_serverless:changeme@127.0.0.1:9200
--kibana=http://elastic_serverless:changeme@0.0.0.0:5601`
- Navigate to Applications and click through the links

### Stateful
- Start a local ES and Kibana instance
- Run the some synthtrace scenarios:
  -  `node scripts/synthtrace mobile.ts --live`
  -  `node scripts/synthtrace service_map.ts --live`
- Navigate to Applications and click through the links

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
patrykkopycinski added a commit that referenced this pull request Mar 20, 2026
This establishes the structure for advanced evaluation capabilities
ported from cursor-plugin-evals and serves as the home for Phases 3-5
of the evals roadmap.

## Architecture

The package is designed to be completely independent from @kbn/evals:

```
Evaluation Suites
     ├──> @kbn/evals (core)
     └──> @kbn/evals-extensions (advanced features)
              └──> depends on @kbn/evals
```

**Dependency Rule:**
- ✅ kbn-evals-extensions CAN import from kbn-evals
- ❌ kbn-evals MUST NOT import from kbn-evals-extensions

## This PR

**What's included:**
- Package structure (package.json, kibana.jsonc, tsconfig.json)
- Placeholder exports (no functional changes)
- Test infrastructure (5 passing tests)
- Comprehensive documentation

**What's NOT included:**
- No functional features (placeholder exports only)
- No changes to @kbn/evals package
- No changes to evaluation suite behavior

## Validation

✅ Bootstrap completed successfully
✅ Type check passed
✅ All tests passing (5/5)
✅ ESLint passed
✅ No circular dependencies
✅ check_changes.ts passed

## Roadmap

This foundation enables parallel development of:
- PR #2: Cost tracking & metadata enrichment
- PR #3: Dataset management utilities
- PR #4: Safety evaluators (toxicity, PII, bias, etc.)
- PR #5: UI components (run comparison, example explorer)
- PR #6: DX enhancements (watch mode, caching, parallel)
- PR #7: Advanced analytics
- PR #8: A/B testing & active learning
- PR #9: Human-in-the-loop workflows
- PR #10: IDE integration

## Related Issues

- Closes part of elastic#257821 (Epic: Extend @kbn/evals)
- Enables elastic#257823 (Phase 2: CI Quality Gates)
- Enables elastic#257824 (Phase 3: Red-Teaming)
- Enables elastic#257825 (Phase 4: Lens Dashboards)
- Enables elastic#257826 (Phase 5: Auto-Generation)
- Addresses elastic#255820 (kbn/evals <-> Agent Builder completeness)

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
patrykkopycinski added a commit that referenced this pull request Mar 21, 2026
…, extract constants

CRITICAL #2: Delete semantic_dedup_elser.ts (always returned null), document Phase 2
HIGH #4: Create fetchAlertsByIds() utility - eliminates 45 lines of duplication
HIGH #6: Add fail-fast bulk error handling - throws on >50% failures, warns on >10%
MEDIUM #8: Extract PIPELINE_LIMITS constants - single source of truth
MEDIUM #9: Verified no emoji in logs (already clean)

Progress: 5/12 deep review issues fixed
Tests: 62/62 passing
Types: No errors
patrykkopycinski added a commit that referenced this pull request Mar 27, 2026
…, extract constants

CRITICAL #2: Delete semantic_dedup_elser.ts (always returned null), document Phase 2
HIGH #4: Create fetchAlertsByIds() utility - eliminates 45 lines of duplication
HIGH #6: Add fail-fast bulk error handling - throws on >50% failures, warns on >10%
MEDIUM #8: Extract PIPELINE_LIMITS constants - single source of truth
MEDIUM #9: Verified no emoji in logs (already clean)

Progress: 5/12 deep review issues fixed
Tests: 62/62 passing
Types: No errors
patrykkopycinski added a commit that referenced this pull request Mar 30, 2026
…, extract constants

CRITICAL #2: Delete semantic_dedup_elser.ts (always returned null), document Phase 2
HIGH #4: Create fetchAlertsByIds() utility - eliminates 45 lines of duplication
HIGH #6: Add fail-fast bulk error handling - throws on >50% failures, warns on >10%
MEDIUM #8: Extract PIPELINE_LIMITS constants - single source of truth
MEDIUM #9: Verified no emoji in logs (already clean)

Progress: 5/12 deep review issues fixed
Tests: 62/62 passing
Types: No errors
patrykkopycinski pushed a commit that referenced this pull request Apr 2, 2026
Closes elastic#258318
Closes elastic#258319

## Summary

Adds logic to the alert episodes table to display `.alert_actions`
information.

This includes:
- New action-specific API paths.
- Snooze
  - **Per group hash.**
- Button in the actions column opens a popover where an `until` can be
picked.
  - **When snoozed**
    - A bell shows up in the status column.
- Mouse over the bell icon to see until when the snooze is in effect.
- Unsnooze
  - **Per group hash.**
  - Clicking the button removes the snooze.
- Ack/Unack
  - **Per episode.**
  - Button in the actions column
  - When "acked", an icon shows in the status column.
- Tags
- This PR only handles displaying tags. They need to be created via API.
- Resolve/Unresolve
  - **Per group hash.**
  - Button inside the ellipsis always
- The status is turned to `inactive` **regardless of the "real"
status.**

<img width="1704" height="672" alt="Screenshot 2026-03-25 at 16 04 12"
src="https://github.com/user-attachments/assets/5ef4111a-6e0c-4114-a60e-ce5f81a86ac6"
/>


## Testing


<details> <summary>POST mock episodes</summary>

```
POST _bulk
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:00:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:01:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "pending" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:02:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:03:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "inactive" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:04:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:05:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:06:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-001", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:07:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:08:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "active" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:09:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "recovering" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:10:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "recovering" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:11:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:12:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "recovering" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:13:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-2", "episode": { "id": "ep-002", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:14:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-003", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:15:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-1", "episode": { "id": "ep-003", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:16:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-4", "episode": { "id": "ep-004", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:17:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-4", "episode": { "id": "ep-004", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:18:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-4", "episode": { "id": "ep-004", "status": "recovering" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:19:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-4", "episode": { "id": "ep-004", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:20:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-5", "episode": { "id": "ep-005", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:21:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-5", "episode": { "id": "ep-005", "status": "pending" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:22:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-5", "episode": { "id": "ep-005", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:23:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-9", "episode": { "id": "ep-006", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:24:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-9", "episode": { "id": "ep-006", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:25:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-9", "episode": { "id": "ep-006", "status": "active" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:26:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-1" }, "group_hash": "gh-9", "episode": { "id": "ep-006", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:14:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-2" }, "group_hash": "gh-7", "episode": { "id": "ep-007", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:15:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-2" }, "group_hash": "gh-7", "episode": { "id": "ep-007", "status": "inactive" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:16:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-3" }, "group_hash": "gh-8", "episode": { "id": "ep-008", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:17:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-3" }, "group_hash": "gh-8", "episode": { "id": "ep-008", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:18:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-3" }, "group_hash": "gh-8", "episode": { "id": "ep-008", "status": "recovering" }, "status": "recovered" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:20:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-4" }, "group_hash": "gh-9", "episode": { "id": "ep-009", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:21:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-4" }, "group_hash": "gh-9", "episode": { "id": "ep-009", "status": "pending" }, "status": "no_data" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:23:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-5" }, "group_hash": "gh-10", "episode": { "id": "ep-010", "status": "pending" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:24:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-5" }, "group_hash": "gh-10", "episode": { "id": "ep-010", "status": "active" }, "status": "breached" }
{ "create": { "_index": ".rule-events" }}
{ "@timestamp": "2026-01-27T16:25:00.000Z", "source": "internal", "type": "alert", "rule": { "id": "rule-5" }, "group_hash": "gh-10", "episode": { "id": "ep-010", "status": "active" }, "status": "no_data" }
```

</details>

- In the POST above, episodes 1 and 3, and episodes 6 and 9 have the
same group hashes.
- Go to `https://localhost:5601/app/observability/alerts-v2` and try all
buttons.

---------

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>
patrykkopycinski added a commit that referenced this pull request May 18, 2026
…nfirm, trace-based workflow success, B3 unavailable, traceId-array support

Eight pipeline fixes discovered while running v6 alert-analysis
workflow empirics (n=5 traced, Sonnet 4.5). Each addresses a measurement
gap that produced misleading scores in earlier smoke runs.

## 1. HITL auto-confirmation in chat_client

`workflow_execute_step` calls with unsafe step types (e.g.
`kibana.request`) trigger Agent Builder's HITL confirmation primitive
(`prompts.askForConfirmation`). The eval runner never answered those
prompts, so the conversation hung and the tool span never executed.

`chat_client.converse` now scans every `ConverseApiResponse` for pending
prompts, auto-allows them (`{ id, allow: true }`), and re-invokes the
conversation with the answers. Capped at `MAX_AUTO_CONFIRM_ROUNDS = 3`
to defend against pathological loops.

Step deduplication uses a composite `(type, tool_call_id)` key — earlier
`tool_call_id`-only dedupe would drop legitimate `tool_call` steps when
a paired `reasoning` step shared the same id within a single round.

## 2. ex=3 expectedSkill: alert-analysis → entity-analytics

Empirically the agent correctly loads the `entity-analytics` skill for
host-risk lookups (security.entity_analytics.risk_score is registered
on that skill), not `alert-analysis`. Updated the metadata; the
trace-based `ExpectedSkillInvocation` evaluator now scores 1.0 on this
example.

## 3. ex=2 expectedSkill removed (Lazarus Group standalone search)

`security.security_labs_search` does not anchor on a specific alert, so
the alert-analysis skill's "When to Use" criteria don't fire. The agent
reasons from base prompts and calls the tool directly, which is correct
behavior. Removed `expectedSkill` so `ExpectedSkillInvocation` short-
circuits to pass (`expectedOnlyToolId` still scores tool correctness).

## 4. ex=3 expectedOnlyToolId: security.entity_risk_score → security.entity_analytics.risk_score

When the `entity-analytics` skill loads, the agent uses the skill's
*inline* risk-score tool (`security.entity_analytics.risk_score`), not
the standalone `security.entity_risk_score`. Updated to match the
actual tool id surfaced in the skill body.

## 5. Workflow_Yaml_Validity, Workflow_PreValidation_PassRate,
    Workflow_Execution_SuccessRate: return `unavailable` when `r.data`
    is null

The Agent Builder server does not back-fill `step.results` after a
HITL-prompted tool executes (the result lives in the trace, not the
step record). Earlier B3 evaluators saw `r.data === null` and either
scored a false 1.0 or false 0.0. Switched to `{ label: 'unavailable',
metadata: { reason, workflowCallCount } }` so the dashboard
distinguishes "we didn't observe the call" from "the call failed".

`Workflow_Execution_SuccessRate` now queries the traces-* index for
`Tool: platform.workflows.workflow_execute_step` spans with
`status.code == "Ok"` — bypasses the step-record gap entirely and
measures the actual execution outcome via OTEL.

## 6. traceId normalized to string[] in custom workflow evaluator + ExpectedSkillInvocation

Same root cause as Fix #8 (framework-side): HITL-resume conversations
produce a `string[]` trace_id. Local `normalizeTraceIds` +
`traceIdInClause` helpers build an `IN (...)` clause so the custom
trace-based queries cover the full conversation, not just the first
round.

## 7. Dead evaluator removed

A hardcoded `createSkillInvocationEvaluator({ skillName:
'data-exploration' })` polluted the metrics with a constant 0 for this
suite (which exercises `alert-analysis` and `entity-analytics`, not
`data-exploration`). The flexible `ExpectedSkillInvocation` evaluator
below already runs the same trace query against the per-example
`expectedSkill` metadata, so removed the hardcoded version as a single
source of truth.

## 8. Concurrency 5 → 2 (default cap)

The default `executorClient.runExperiment` concurrency of 5 OOM'd
Kibana on n=5 full runs (Sonnet 4.5 inflates working set per concurrent
conversation). Default lowered to 2; `EVALUATION_CONCURRENCY` env var
overrides for resource-rich envs.

## Empirics — n=5 traced run with all fixes (Sonnet 4.5)

- 525/525 docs persisted, 0 retries, 41.9-min wall clock
- `Skill Invoked`, `ExpectedToolCalled`, `ExpectedWorkflowRequest`,
  `Sequence Accuracy`, `Workflow_Execution_SuccessRate`,
  `Workflow_Yaml_Validity`, `Workflow_PreValidation_PassRate` all 1.0
- Sub-1.0 scores (`ToolUsageOnly` 0.52, `Factuality` 0.20,
  `Relevance` 0.38, `Discovery_First_Pattern_Usage` 0.0) reflect real
  agent behavior + empty test cluster, not eval bugs
patrykkopycinski pushed a commit that referenced this pull request May 25, 2026
## Summary

Set `connect.timeout = 60s` on the undici `Agent` used by
`KbnClientRequester` (https path only).

## Why

elastic#268531 migrated `KbnClient` from axios to native fetch but did not
override undici's 10s `connect.timeout` default. Axios had no equivalent
cutoff, so FTR callers talking to a busy local Kibana started failing
once that PR landed.

The `kibana-streams-performance` weekly pipeline went red in builds #9,
#11, elastic#12, and elastic#13 with:

```
ConnectTimeoutError: Connect Timeout Error (attempted address: localhost:5620, timeout: 10000ms)
```

The `10000ms` is undici's default. Bisect: build #8 last green
(2026-05-11) → #9 first red (2026-05-18), with elastic#268531 in the window.

## What changed


`src/platform/packages/shared/kbn-kbn-client/src/kbn_client/kbn_client_requester.ts`:
one constant, one option on the https `Agent`. http branch unchanged.

## Related

Regression introduced in elastic#268531. Companion streams perf PR: elastic#270636.

## Validation

https://buildkite.com/elastic/kibana-streams-performance/builds/14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant